[Day09] - Datatype：三種容器型別 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 9

Software Development

Polars熊霸天下系列第 9 篇

[Day09] - Datatype：三種容器型別

17th鐵人賽 python polars

Jerry Wu

2025-09-15 08:32:32

151 瀏覽

分享至

今天我們來了解pl.Array、pl.List及pl.Struct三種容器型別。

本日大綱如下：

本日引入模組及準備工作
pl.Array
pl.List
pl.Struct
小結
codepanda

1. `pl.Array`

pl.Array類似於Python的元組，相較於pl.List，其記憶體用量較小且效能較佳，但是其arr命名空間所提供的expr較少，適合用在其內元素數量為固定值時。

以下我們建立一個df1 DataFrame，用來模擬三位玩家，每人擲三次骰子的記錄。其內的「"numbers"」列為pl.Array型別，由於pl.Array不會自動為使用者決定型別，所以我們需要事先指定型別（需為同一型別）及形狀。此處我們設定「"numbers"」列的array內為pl.UInt64型別，且每個array中有三個元素。

df1 = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "numbers": [[5, 15, 3], [11, 14, 6], [13, 18, 12]],
    },
    schema_overrides={"id": pl.UInt64, "numbers": pl.Array(pl.UInt64, 3)},
)

shape: (3, 2)
┌─────┬───────────────┐
│ id  ┆ numbers       │
│ --- ┆ ---           │
│ u64 ┆ array[u64, 3] │
╞═════╪═══════════════╡
│ 1   ┆ [5, 15, 3]    │
│ 2   ┆ [11, 14, 6]   │
│ 3   ┆ [13, 18, 12]  │
└─────┴───────────────┘

以下我們展示三種arr命名空間所提供的expr：

使用pl.Expr.arr.first()得到array第一個元素。
使用pl.Expr.arr.last()得到array最後一個元素。
使用pl.Expr.arr.get()取得array內索引值為1的元素。

df1.with_columns(
    pl.col("numbers").arr.first().alias("first"),
    pl.col("numbers").arr.last().alias("last"),
    pl.col("numbers").arr.get(1).alias("get_1"),
)

shape: (3, 5)
┌─────┬───────────────┬───────┬──────┬───────┐
│ id  ┆ numbers       ┆ first ┆ last ┆ get_1 │
│ --- ┆ ---           ┆ ---   ┆ ---  ┆ ---   │
│ u64 ┆ array[u64, 3] ┆ u64   ┆ u64  ┆ u64   │
╞═════╪═══════════════╪═══════╪══════╪═══════╡
│ 1   ┆ [5, 15, 3]    ┆ 5     ┆ 3    ┆ 15    │
│ 2   ┆ [11, 14, 6]   ┆ 11    ┆ 6    ┆ 14    │
│ 3   ┆ [13, 18, 12]  ┆ 13    ┆ 12   ┆ 18    │
└─────┴───────────────┴───────┴──────┴───────┘

2. `pl.List`

pl.List類似於Python的列表，其list命名空間提供了比arr命名空間更豐富的expr，其中的pl.Expr.list.eval()可以針對pl.List內每一個元素進行操作，可說是殺手級的expr，以下透過一個例子來說明其相關操作。

首先建立一個df2 DataFrame，用來模擬三位玩家，每人擲三次骰子的記錄。其內的「"numbers"」列為pl.List型別，內部型別為pl.String，並以空白分隔。

df2 = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "numbers": [
            "5 15 1",
            "None 14 6",
            "13 18 19",
        ],
    },
    schema_overrides={"id": pl.UInt64},
)

shape: (3, 2)
┌─────┬───────────┐
│ id  ┆ numbers   │
│ --- ┆ ---       │
│ u64 ┆ str       │
╞═════╪═══════════╡
│ 1   ┆ 5 15 1    │
│ 2   ┆ None 14 6 │
│ 3   ┆ 13 18 19  │
└─────┴───────────┘

眼尖的您可能會發現這組資料有點問題，因為每個玩家投擲三次骰子，那點數應該會介於3~18之間。

其中第一位玩家最後投擲的「"1"」、第二位玩家首次投擲的「"None"」及第三位玩家最後投擲的「"19"」都是異常值，我們將嘗試使用pl.List，來找出每位玩家投擲異常的數值及找出分別是第幾次投擲時發生。以下是我們最終希望得到的結果：

(
    df2.with_columns(
        pl.col("numbers")
        .str.split(" ")
        .list.eval(pl.element().cast(pl.UInt64, strict=False))
        .alias("list")
    )
    .with_columns(
        pl.col("list")
        .list.eval(
            pl.element()
            .gt(18)
            .or_(pl.element().lt(3))
            .or_(pl.element().is_null())
        )
        .list.eval(pl.element().arg_true())
        .alias("outlier_indexes")
    )
    .with_columns(
        pl.col("list")
        .list.gather(pl.col("outlier_indexes"))
        .alias("outliers")
    )
)

shape: (3, 5)
┌─────┬───────────┬───────────────┬─────────────────┬───────────┐
│ id  ┆ numbers   ┆ list          ┆ outlier_indexes ┆ outliers  │
│ --- ┆ ---       ┆ ---           ┆ ---             ┆ ---       │
│ u64 ┆ str       ┆ list[u64]     ┆ list[u32]       ┆ list[u64] │
╞═════╪═══════════╪═══════════════╪═════════════════╪═══════════╡
│ 1   ┆ 5 15 1    ┆ [5, 15, 1]    ┆ [2]             ┆ [1]       │
│ 2   ┆ None 14 6 ┆ [null, 14, 6] ┆ [0]             ┆ [null]    │
│ 3   ┆ 13 18 19  ┆ [13, 18, 19]  ┆ [2]             ┆ [19]      │
└─────┴───────────┴───────────────┴─────────────────┴───────────┘

以下分段說明。

第一個`df.with_columns()`

利用pl.Expr.str.split()將「"numbers"」列以空白分開，這將形成一個pl.list。接著使用pl.Expr.list.eval()搭配pl.element()來將其內每個元素轉為pl.UInt64型別（由於型別轉換可能會失敗，所以這邊將strict=設為False）。最後指定此列名為「"list"」。

(
    df2.with_columns(
        pl.col("numbers")
        .str.split(" ")
        .list.eval(pl.element().cast(pl.UInt64, strict=False))
        .alias("list")
    )
)

shape: (3, 3)
┌─────┬───────────┬───────────────┐
│ id  ┆ numbers   ┆ list          │
│ --- ┆ ---       ┆ ---           │
│ u64 ┆ str       ┆ list[u64]     │
╞═════╪═══════════╪═══════════════╡
│ 1   ┆ 5 15 1    ┆ [5, 15, 1]    │
│ 2   ┆ None 14 6 ┆ [null, 14, 6] │
│ 3   ┆ 13 18 19  ┆ [13, 18, 19]  │
└─────┴───────────┴───────────────┘

如果還不太能理解的話，可以將pl.Expr.list.eval()搭配pl.element()的語法想成是類似於在Python中遍歷列表，並同時變更元素，最後收集為一個新列表。

第二個`df.with_columns()`

針對「"list"」列使用兩次的pl.Expr.list.eval()：

第一次我們以布林值標出異常值（值小於3或值大於18或值為null）。
第二次使用pl.Expr.arg_true()找出pl.List中，布林值為True的索引值。

最後指定此列名為「"outlier_indexes"」。

(
    df2
    ...
    .with_columns(
        pl.col("list")
        .list.eval(
            pl.element()
            .gt(18)
            .or_(pl.element().lt(3))
            .or_(pl.element().is_null())
        )
        .list.eval(pl.element().arg_true())
        .alias("outlier_indexes")
    )
)

shape: (3, 4)
┌─────┬───────────┬───────────────┬─────────────────┐
│ id  ┆ numbers   ┆ list          ┆ outlier_indexes │
│ --- ┆ ---       ┆ ---           ┆ ---             │
│ u64 ┆ str       ┆ list[u64]     ┆ list[u32]       │
╞═════╪═══════════╪═══════════════╪═════════════════╡
│ 1   ┆ 5 15 1    ┆ [5, 15, 1]    ┆ [2]             │
│ 2   ┆ None 14 6 ┆ [null, 14, 6] ┆ [0]             │
│ 3   ┆ 13 18 19  ┆ [13, 18, 19]  ┆ [2]             │
└─────┴───────────┴───────────────┴─────────────────┘

第三個`df.with_columns()`

針對「"list"」列使用pl.Expr.list.gather()，並以「"outlier_indexes"」列作為索引值。這邊請留意，由於pl.Expr.list.gather()接受pl.List型別，所以可以取得多個元素。只是在這個例子中，三位玩家都只有一個異常值。最後指定此列名為「"outliers"」。

(
    df2
    ...
    .with_columns(
        pl.col("list")
        .list.gather(pl.col("outlier_indexes"))
        .alias("outliers")
    )
)

shape: (3, 5)
┌─────┬───────────┬───────────────┬─────────────────┬───────────┐
│ id  ┆ numbers   ┆ list          ┆ outlier_indexes ┆ outliers  │
│ --- ┆ ---       ┆ ---           ┆ ---             ┆ ---       │
│ u64 ┆ str       ┆ list[u64]     ┆ list[u32]       ┆ list[u64] │
╞═════╪═══════════╪═══════════════╪═════════════════╪═══════════╡
│ 1   ┆ 5 15 1    ┆ [5, 15, 1]    ┆ [2]             ┆ [1]       │
│ 2   ┆ None 14 6 ┆ [null, 14, 6] ┆ [0]             ┆ [null]    │
│ 3   ┆ 13 18 19  ┆ [13, 18, 19]  ┆ [2]             ┆ [19]      │
└─────┴───────────┴───────────────┴─────────────────┴───────────┘

3. `pl.Struct`

pl.struct類似於Python的字典（更準確地說是typing.TypedDict）。

以下我們建立一個df3 DataFrame，用來模擬三位玩家，每人擲三次骰子的記錄。其內的「"numbers"」列為pl.Struct型別，內部型別皆為pl.UInt64：

df3 = pl.DataFrame(
    {
        "numbers": [
            {"first": 5, "second": 15, "third": 15},
            {"first": 5, "second": 14, "third": 6},
            {"first": 13, "second": 18, "third": 5},
        ]
    },
    schema={
        "numbers": pl.Struct(
            {"first": pl.UInt64, "second": pl.UInt64, "third": pl.UInt64}
        )
    },
)

shape: (3, 1)
┌───────────┐
│ numbers   │
│ ---       │
│ struct[3] │
╞═══════════╡
│ {5,15,15} │
│ {5,14,6}  │
│ {13,18,5} │
└───────────┘

使用pl.Expr.struct.unnest()可以將pl.struct的各個field拆開為多列：

df3.select(pl.col("numbers").struct.unnest())

shape: (3, 3)
┌───────┬────────┬───────┐
│ first ┆ second ┆ third │
│ ---   ┆ ---    ┆ ---   │
│ u64   ┆ u64    ┆ u64   │
╞═══════╪════════╪═══════╡
│ 5     ┆ 15     ┆ 15    │
│ 5     ┆ 14     ┆ 6     │
│ 13    ┆ 18     ┆ 5     │
└───────┴────────┴───────┘

使用pl.Expr.struct.field()可以將一或多個field獨立為一或多列。例如：

df3.select(pl.col("numbers").struct.field("first"))

shape: (3, 1)
┌───────┐
│ first │
│ ---   │
│ u64   │
╞═══════╡
│ 5     │
│ 5     │
│ 13    │
└───────┘

如果使用*，其效果與pl.Expr.struct.unnest()一樣：

df3.select(pl.col("numbers").struct.field("*"))

shape: (3, 3)
┌───────┬────────┬───────┐
│ first ┆ second ┆ third │
│ ---   ┆ ---    ┆ ---   │
│ u64   ┆ u64    ┆ u64   │
╞═══════╪════════╪═══════╡
│ 5     ┆ 15     ┆ 15    │
│ 5     ┆ 14     ┆ 6     │
│ 13    ┆ 18     ┆ 5     │
└───────┴────────┴───────┘

使用pl.Expr.value_counts()可以計算三位玩家擲三次骰子（共計九次），其結果的出現頻率：

(
    df3.select(
        pl.col("numbers")
        .struct.field("first")
        .append(pl.col("numbers").struct.field("second"))
        .append(pl.col("numbers").struct.field("third"))
        .value_counts(sort=True)
        .alias("counts")
    )
)

shape: (6, 1)
┌───────────┐
│ counts    │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {5,3}     │
│ {15,2}    │
│ {13,1}    │
│ {14,1}    │
│ {18,1}    │
│ {6,1}     │
└───────────┘

如果是要動態建構pl.struct型別的話，可以這麼寫：

df4 = pl.DataFrame(
    {"first": [5, 5, 13], "second": [15, 14, 18], "third": [15, 6, 5]},
    schema={"first": pl.UInt64, "second": pl.UInt64, "third": pl.UInt64},
)

(
    df4.with_columns(
        pl.struct("first", "second", "third").alias("combined")
    )
)

shape: (3, 4)
┌───────┬────────┬───────┬───────────┐
│ first ┆ second ┆ third ┆ combined  │
│ ---   ┆ ---    ┆ ---   ┆ ---       │
│ u64   ┆ u64    ┆ u64   ┆ struct[3] │
╞═══════╪════════╪═══════╪═══════════╡
│ 5     ┆ 15     ┆ 15    ┆ {5,15,15} │
│ 5     ┆ 14     ┆ 6     ┆ {5,14,6}  │
│ 13    ┆ 18     ┆ 5     ┆ {13,18,5} │
└───────┴────────┴───────┴───────────┘

最後，進階的使用時機是當同時需要多列的資訊來進行運算時。例如需要透過一個函數來計算各行之和：

(
    df4.with_columns(
        pl.struct("first", "second", "third")
        .alias("combined")
        .map_batches(
            lambda x: x.struct.field("first")
            + x.struct.field("second")
            + x.struct.field("third"),
            return_dtype=pl.UInt64,
        )
        .alias("sum")
    )
)

shape: (3, 4)
┌───────┬────────┬───────┬─────┐
│ first ┆ second ┆ third ┆ sum │
│ ---   ┆ ---    ┆ ---   ┆ --- │
│ u64   ┆ u64    ┆ u64   ┆ u64 │
╞═══════╪════════╪═══════╪═════╡
│ 5     ┆ 15     ┆ 15    ┆ 35  │
│ 5     ┆ 14     ┆ 6     ┆ 25  │
│ 13    ┆ 18     ┆ 5     ┆ 36  │
└───────┴────────┴───────┴─────┘

此處的pl.Expr.map_batches()是Polars提供串接函數的接口，此函數需要能以列來進行運算。如果是想串接一個針對元素進行運算的函數時，應該使用pl.Expr.map_elements()。

pl.Expr.map_batches()及pl.Expr.map_elements()的效率較差，一般作為串接第三方套件函數之用。如果可能的話，我們應該盡量依賴Polars提供的各種expr來完成計算，例如：

(
    df4.with_columns(
        pl.sum_horizontal("first", "second", "third").alias("sum")
    )
)

shape: (3, 4)
┌───────┬────────┬───────┬─────┐
│ first ┆ second ┆ third ┆ sum │
│ ---   ┆ ---    ┆ ---   ┆ --- │
│ u64   ┆ u64    ┆ u64   ┆ u64 │
╞═══════╪════════╪═══════╪═════╡
│ 5     ┆ 15     ┆ 15    ┆ 35  │
│ 5     ┆ 14     ┆ 6     ┆ 25  │
│ 13    ┆ 18     ┆ 5     ┆ 36  │
└───────┴────────┴───────┴─────┘

當必須依靠第三方函數，可以優先參考Numpy是否有提供。如果沒有的話，再考慮使用Numba來編寫客製化函數。由於這部份是比較進階的內容，有興趣的朋友可以參考教學文件。

4. 小結

在pl.Array及pl.List中應優先考慮使用pl.Array，因其記憶體使用量較少及效能較佳，除非所需要的操作在arr命名空間中沒有提供。另外，別忘了arr命名空間提供有pl.Expr.arr.to_list()及pl.Expr.arr.to_struct()，可以將pl.Array轉換為pl.List或pl.Struct。